Contributions

During the lab, Lakshidaa focused on assignment 1 and Hariprasath focused on assignment 2. After the lab, both of us collaborated to complete the remaining tasks and to clarify each other’s doubts. Then we discussed together and answered the questions related to analysis and interpretation of the plots.

Assignment 1 : Perception in Visualization

olive_data <- read.csv("olive.csv")

Task 1.1

First, we make a scatterplot that shows dependence of Palmitic on Oleic with observations colored by Linolenic.

ggplot(olive_data) + geom_point(aes(x = oleic, y = palmitic, color = linolenic)) +
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5))

Second, we make a scatter plot that shows dependence of Palmitic on Oleic with observations colored by Linolenic variable which is divided into four discrete classes.

ggplot(olive_data) + 
  geom_point(aes(x = oleic, y = palmitic, 
                 color=cut_interval(olive_data$linolenic, n = 4))) +
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color = 'Linolenic range') 

The first plot has its observations colored continuously by Linolenic values whereas the second one has its observations marked based on four classes of Linolenic data. It would be hard to analyse the intermediate values of data in both the plots. In the first plot, it is difficult to precisely identify the individual data points. In the second plot, viewer can easily identify the class and the data range of a data point but cannot identify the value.

Viewers could correctly classify hues with upto 10 levels (3.1 bits) and brightness upto 5 levels (2.3 bits). In the first plot, there are a large number of brightness levels and it will be very difficult for viewers to correctly identify values. In the second plot, there are only 4 hue levels and hence, it will be easy for viewers to correctly classify the data points. This is the perception problem that is observed in this experiment. Also, extreme outliers in the variable mapped to color could lead to most of the points having a similar color.

Task 1.2

Task 1.2.a

First, we make a scatterplot of Palmitic vs Oleic mapping the discretized Linolenic with four classes to color.

ggplot(olive_data) + 
  geom_point(aes(x = oleic, y = palmitic, color = cut_interval(linolenic, n = 4)))+
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color = 'Linolenic range') 

Task 1.2.b

Second, we make a scatterplot of Palmitic vs Oleic mapping the discretized Linolenic with four classes to size.

ggplot(olive_data) + geom_point(aes(x = oleic, y = palmitic, size = cut_interval(linolenic, n = 4))) + 
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_size_manual(name = "Linolenic range", values = c(1, 2, 3, 4))

Task 1.2.c

Third, we make a scatterplot of Palmitic vs Oleic mapping the discretized Linolenic with four classes to orientation angle.

# Pre-processing - Setting angle values based on category
olive_data$linolenic_class <- cut_interval(olive_data$linolenic, n = 4)
levels(olive_data$linolenic_class) <- (0:3) * (pi/4)
olive_data$linolenic_class <- as.numeric(as.character(olive_data$linolenic_class))

ggplot(olive_data, aes(x=oleic, y=palmitic)) + geom_point() +  
  geom_spoke(aes(angle = olive_data$linolenic_class), radius=40) + 
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic 
Legend
Orientation angle of spoke : Linolenic class
0:(0,18.5], 45:(18.5,37], 90:(37,55.5], 135:(0,18.5] ") +
  theme(plot.title = element_text(hjust = 0.5)) 

After observing the above 3 plots, it can be observed that the plots using size and orientation angle to represent the classes of Linolenic are the ones where it is more difficult to differentiate between categories.

In the case of mapping the discretized Linolenic variable to size, there is a lot of overlap of data points which cannot be differentiated properly. Generally, it should be possible to identify 4-5 levels of size (2.2 bits) but in this case, because of the occlusion problem we are not able to perceive the differences in categories.

In the case of mapping the discretized Linolenic with four classes to orientation angle, the observations are at various angles depending on the class of Linolenic they belong to (0/45/90/135 degrees). As stated in the literature, when there is a single sloping line within multiple straight lines, we are easily able to detect it by perception. But, if there is a single straight line within multiple sloping lines, our perception finds it harder to detect this straight line. Similarly, in this plot, there are multiple straight and sloping lines which overlap and make it difficult to identify decision boundaries. The perception metrics in this case would be 6-10 levels (2.8 to 3.3 bits). However, when used in a crowded scatterplot, it is difficult to differentiate them.

Task 1.3

First, we make a scatterplot of Oleic vs Eicosenoic where color is defined by numeric values of Region.

ggplot(olive_data)+
  geom_point(aes(x=eicosenoic,y=oleic,color=Region)) +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Region") +
  theme(plot.title = element_text(hjust = 0.5))

Second, we make a scatterplot of Oleic vs Eicosenoic where color is defined by categorical Region variable.

olive_data$Region_category <- "Sardinia Island"
olive_data[olive_data$Region == 1, c("Region_category")] <- "North"
olive_data[olive_data$Region == 2, c("Region_category")] <- "South"

ggplot(olive_data)+
  geom_point(aes(x=eicosenoic,y=oleic,color=Region_category)) +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Region") +
  theme(plot.title = element_text(hjust = 0.5))

In the first plot, the region is used as a numeric variable and mapped to color (brightness). So, there is an inherent relationship between the values of Region (ie. Region=3 > Region=2 > Region=1) in the color scale. But, in reality no such relationship exists as it is a categorical variable. This is the problem with the first plot.

In the case of the second plot, where Region is taken as a categorical variable and mapped to color (hues), the decision boundaries are clearly visible and we are able to identify them very quickly. Preattentive mechanism makes this possible. Hence, the 2nd plot can be more easily interpreted than the first plot.

Task 1.4

We make a scatterplot of Oleic vs Eicosenoic where color is defined by discretized Linoleic variable (3 classes), shape is defined by discretized Palmitic variable (3 classes) and size is defined by discretized Palmitoleic variable (3 classes).

ggplot(olive_data)+
  geom_point(aes(x = oleic, y = eicosenoic, color = cut_interval(linoleic, n = 3),
                            shape = cut_interval(palmitic, n = 3),
                            size = cut_interval(palmitoleic, n = 3))) + 
  scale_size_manual(values = c(2,3,4)) +
  labs(shape = "Palmitic range", color = "Linoleic range", size = "Palmitoleic range") +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Linoleic vs Palmitic vs Palmitoleic") +
  theme(plot.title = element_text(hjust = 0.5))

With the use of 3 levels of mapping, 27 different types of markers are present in the above plot. It is very difficult to differentiate between each of these observations. Our channel capacity is limited to 10 levels of hue (3.1 bits) and 5 levels of brightness (2.3 bits), 4-5 levels of size (2.2 bits). On average, we are limited to 6-7 levels of different observations (2.6 bits). When using multiple mappings at once, our channel capacity does not increase linearly as a sum of their individual channel capacities. For example, when size, brightness and hue are used together, we have a channel capacity of 4.1 bits while sum of the channel capacities is 7.6 bits. Because of this, we cannot detect things in a preattentive manner and makes it difficult to easily interpret the plot.

Task 1.5

We make a scatterplot of Oleic vs Eicosenoic where color is defined by Region, shape is defined by a discretized Palmitic (3 classes) and size is defined by a discretized Palmitoleic (3 classes).

ggplot(olive_data)+
  geom_point(aes(x=eicosenoic, y=oleic, color=Region_category, 
                 shape=cut_interval(palmitic,n=3), 
                 size=cut_interval(palmitoleic,n=3))) + 
  scale_size_manual(values = c(2,3,4)) +
  labs(shape="Palmitic range", size="Palmitoleic range") +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Region vs Palmitic vs Palmitoleic") +
  theme(plot.title = element_text(hjust = 0.5))

It is possible to see a clear decision boundary between Regions despite so many aesthetics being used. According to Triesman’s theory, our visual system spilts each set of features into separate maps and processes them in parallel and also, a significant target-nontarget feature difference would allow individual feature maps to ignore non-target information contained in the master map. This is clearly applicable here. The Region variable denoted by color has clear decision boundaries and this can be quickly detected from the feature map corresponding to color. So, we do not get distracted by the other featues in the master map. Hence, we are able to identify the clusters based on Region eventhough there are various other features in the plot.

Task 1.6

We make a pie chart that shows the proportions of oils coming from different Areas.

plot_ly(olive_data,labels=~Area,type='pie',textinfo = "none") %>%
  layout(title = "Pie chart of proportion of oils coming from different areas")

In the above pie chart, we can clearly see that the highest proportion of oils come from South-Apulia. But, the other pies in the chart are very similar in size and difficult to differentiate which one in higher/lower without the percentage labels. Angle and area were found to be more error prone compared to position on a common scale and length in terms of perceiving relative differences by William Cleveland and his colleagues. Due to the relative differences in areas/angles being very small, the viewer could easily make errors while interpreting the pie chart.

Task 1.7

2d-density contour plot which shows the dependence of Linoleic vs Eicosenoic.

ggplot(olive_data)+geom_density_2d(aes(x=eicosenoic, y=linoleic)) +
  ggtitle("Contour plot of Linoleic vs Eicosenoic") +
  theme(plot.title = element_text(hjust = 0.5))

When compared to the scatterplot, in the contour plot, the contour lines look like there is some overlap between the clusters on the left and clusters on the right. But, we can see in the scatterplot that the clusters are clearly separated. In this way, the contour plot is misleading in the blank spaces between the clusters. This is probably caused by the interpolation used while plotting the contour lines.

Assignment 2

Task 2.1

baseball_data = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1")

The variables in the baseball data have very different data ranges, like OBP which varies from 0.2990 to 0.3480 and AB which varies from 5330 to 5670. Multidimensional scaling uses distances between data records as a metric in the algorithm. Having variables with such different ranges would lead to distances being heavily skewed by the variables with higher data values. So, it is reasonable to scale the variables before performing Multidimensional Scaling (MDS).

Task 2.2

plot_ly(baseball_data_with_mds, x=~MDS_V1, y=~MDS_V2) %>% 
  add_markers(color=~League, colors = c("blue", "red"), 
              text=baseball_data_with_mds$Team, hoverinfo="text") %>%
  layout(title="Scatterplot of the 2 MDS variables")

From the scatterplot, we see that there is some difference between the 2 leagues. More teams from NL league are at the bottom of the plot as compared to teams from the AL league. Most of the teams from the AL league are concentrated in the central portion in the upper part of the plot (MDS_V1 > -3, MDS_V1 < 4, MDS_V2 > -1.5 ). Among the 2 MDS variables, MDS_V2 appears to differentiate the 2 leagues the best since it is possible to observe a boundary around MDS_V2 = -1. It is difficult to observe such a boundary using the MDS_V1 variable.

Outliers: Based on the plot, Boston Red Sox seems like an outlier. While there are other teams also which lie in the other extremes, they are not isolated and hence, do not appear like outliers.

Task 2.3

plot_ly()%>%
  add_markers(x=~delta, y=~D, name="Observation pairs", hoverinfo = 'text', 
              text = ~paste('Obj 1: ', 
                            rownames(baseball_data_with_mds)[index1], 
                            '<br> Obj 2: ', 
                            rownames(baseball_data_with_mds)[index2])) %>%
  add_lines(x=~shp$x, y=~shp$yf, name="Isotonic regression line") %>%
  layout(title="Shepard's plot of MDS operation")

The mapping line in the Shepard’s plot looks almost linear and apart from a few points, most of them are crowded close to this line. So, it seems like the stress involved in the MDS operation must have been relatively low. So, it appears like the MDS operation was successful.

While mapping observations in Non-metric MDS, if the points lie on a straight line, then it is easiest to map the observations. And those points which are farthest from the regression line of D and delta are the hardest to map. Based on this logic, we can see that observation pairs - 17&1, 20&16. 2&1 were the most difficult to map as they lie farthest from the regression line.

Task 2.4

MDS_V2 is selected as the variable which best differentiates the leagues among the two MDS variables. By creating scatterplots of MDS_V2 with other numeric variables, it was observed that HR (Home runs) and 3B (Triples) have the strongest connection/correlation with the chosen MDS variable (MDS_V2).

plot_ly(baseball_data_with_mds) %>% add_markers(x=~MDS_V2, y=~HR) %>%
  layout(title="Scatterplot of Home Runs vs MDS_V2", yaxis=list(title="Home Runs"))
plot_ly(baseball_data_with_mds) %>% add_markers(x=~MDS_V2, y=~X3B) %>%
  layout(title="Scatterplot of Triples vs MDS_V2", yaxis=list(title="Triples"))

According to Wikipedia, a home run (HR) is scored when the ball is hit in such a way that the batter is able to circle the bases and reach home safely in one play without any errors being committed by the defensive team in the process. According to Wikipedia, a triple (3B) is the act of a batter safely reaching third base after hitting the ball, with neither the benefit of a fielder’s misplay nor another runner being put out on a fielder’s choice. Both the variables are very important in scoring baseball teams as they are the most important way for a batting team to score runs. The MDS_V2 variable is positively correlated with Home runs and negatively correlated with Triples. Both the variables depict the batting characteristics of a team. So, MDS_V2 variable represents the batting characteristics of a team and probably denotes the ability of teams to score more home runs compared to triples.

Appendix

knitr::opts_chunk$set(echo = TRUE)
# Loading required R packages
library(xlsx)
library(ggplot2)
library(plotly)
library(MASS)
library(shiny)

olive_data <- read.csv("olive.csv")

# TASK 1.1

ggplot(olive_data) + geom_point(aes(x = oleic, y = palmitic, color = linolenic)) +
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5))


ggplot(olive_data) + 
  geom_point(aes(x = oleic, y = palmitic, 
                 color=cut_interval(olive_data$linolenic, n = 4))) +
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color = 'Linolenic range') 

#------------------------------------------------------------------------------

# TASK 1.2

ggplot(olive_data) + 
  geom_point(aes(x = oleic, y = palmitic, color = cut_interval(linolenic, n = 4)))+
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(color = 'Linolenic range') 


ggplot(olive_data) + geom_point(aes(x = oleic, y = palmitic, size = cut_interval(linolenic, n = 4))) + 
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_size_manual(name = "Linolenic range", values = c(1, 2, 3, 4))


# Pre-processing - Setting angle values based on category
olive_data$linolenic_class <- cut_interval(olive_data$linolenic, n = 4)
levels(olive_data$linolenic_class) <- (0:3) * (pi/4)
olive_data$linolenic_class <- as.numeric(as.character(olive_data$linolenic_class))

ggplot(olive_data, aes(x=oleic, y=palmitic)) + geom_point() +  
  geom_spoke(aes(angle = olive_data$linolenic_class), radius=40) + 
  ggtitle("Dependence of Palmitic vs Oleic vs Linolenic 
Legend
Orientation angle of spoke : Linolenic class
0:(0,18.5], 45:(18.5,37], 90:(37,55.5], 135:(0,18.5] ") +
  theme(plot.title = element_text(hjust = 0.5)) 

#------------------------------------------------------------------------------

# TASK 1.3

ggplot(olive_data)+
  geom_point(aes(x=eicosenoic,y=oleic,color=Region)) +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Region") +
  theme(plot.title = element_text(hjust = 0.5))


olive_data$Region_category <- "Sardinia Island"
olive_data[olive_data$Region == 1, c("Region_category")] <- "North"
olive_data[olive_data$Region == 2, c("Region_category")] <- "South"

ggplot(olive_data)+
  geom_point(aes(x=eicosenoic,y=oleic,color=Region_category)) +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Region") +
  theme(plot.title = element_text(hjust = 0.5))

#------------------------------------------------------------------------------

# TASK 1.4

ggplot(olive_data)+
  geom_point(aes(x = oleic, y = eicosenoic, color = cut_interval(linoleic, n = 3),
                            shape = cut_interval(palmitic, n = 3),
                            size = cut_interval(palmitoleic, n = 3))) + 
  scale_size_manual(values = c(2,3,4)) +
  labs(shape = "Palmitic range", color = "Linoleic range", size = "Palmitoleic range") +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Linoleic vs Palmitic vs Palmitoleic") +
  theme(plot.title = element_text(hjust = 0.5))

#------------------------------------------------------------------------------

# TASK 1.5

ggplot(olive_data)+
  geom_point(aes(x=eicosenoic, y=oleic, color=Region_category, 
                 shape=cut_interval(palmitic,n=3), 
                 size=cut_interval(palmitoleic,n=3))) + 
  scale_size_manual(values = c(2,3,4)) +
  labs(shape="Palmitic range", size="Palmitoleic range") +
  ggtitle("Dependence of Oleic vs Eicosenoic vs Region vs Palmitic vs Palmitoleic") +
  theme(plot.title = element_text(hjust = 0.5))

#------------------------------------------------------------------------------

# TASK 1.6

plot_ly(olive_data,labels=~Area,type='pie',textinfo = "none") %>%
  layout(title = "Pie chart of proportion of oils coming from different areas")

#------------------------------------------------------------------------------

# TASK 1.7

ggplot(olive_data)+geom_density_2d(aes(x=eicosenoic, y=linoleic)) +
  ggtitle("Contour plot of Linoleic vs Eicosenoic") +
  theme(plot.title = element_text(hjust = 0.5))

#------------------------------------------------------------------------------

# Task 2.1

baseball_data = read.xlsx("baseball-2016.xlsx", sheetName = "Sheet1")

#------------------------------------------------------------------------------

# Task 2.2

baseball_scaled <- scale(baseball_data[,3:length(baseball_data)])
distance_matrix <- dist(baseball_scaled)
mds_result <- isoMDS(distance_matrix, k=2, p=2)
coords <- mds_result$points
coords_mds <- as.data.frame(coords)
baseball_data_with_mds <- baseball_data
baseball_data_with_mds$MDS_V1 <- coords_mds$V1
baseball_data_with_mds$MDS_V2 <- coords_mds$V2


plot_ly(baseball_data_with_mds, x=~MDS_V1, y=~MDS_V2) %>% 
  add_markers(color=~League, colors = c("blue", "red"), 
              text=baseball_data_with_mds$Team, hoverinfo="text") %>%
  layout(title="Scatterplot of the 2 MDS variables")


#------------------------------------------------------------------------------

# Task 2.3

shp <- Shepard(distance_matrix, coords)
delta <- as.numeric(distance_matrix)
D <- as.numeric(dist(coords))

n <- nrow(coords)
index <- matrix(1:n, nrow=n, ncol=n)
index1 <- as.numeric(index[lower.tri(index)])

n <- nrow(coords)
index <- matrix(1:n, nrow=n, ncol=n, byrow = T)
index2 <- as.numeric(index[lower.tri(index)])

plot_ly()%>%
  add_markers(x=~delta, y=~D, name="Observation pairs", hoverinfo = 'text', 
              text = ~paste('Obj 1: ', 
                            rownames(baseball_data_with_mds)[index1], 
                            '<br> Obj 2: ', 
                            rownames(baseball_data_with_mds)[index2])) %>%
  add_lines(x=~shp$x, y=~shp$yf, name="Isotonic regression line") %>%
  layout(title="Shepard's plot of MDS operation")


#------------------------------------------------------------------------------

# Task 2.4

# Shiny app to compare the scatterplots of each numeric variable vs MDS_V2

# # START SHINY APP--------------------------------------------------------------
# #UI component
# ui <- fluidPage(
#   selectInput(inputId="var_name", label="Choose variable", 
#               choices=colnames(baseball_data_with_mds[,c(-1,-2)])),
#   plotlyOutput("scatterPlot", height = "650px"),
#   verbatimTextOutput("info")
# )
# 
# # Server component
# server <- function(input, output) {
# 
# 
# 
#   output$scatterPlot <- renderPlotly({
#     selected_var = input$var_name
#     fit <- lm(paste(selected_var, "~", "MDS_V2"), data = baseball_data_with_mds)
#     plot_ly(x=baseball_data_with_mds$MDS_V2) %>% 
#       add_markers(name="Data points", 
#                   y=baseball_data_with_mds[[selected_var]]) %>% 
#       add_lines(name="Regression line", 
#                 x=baseball_data_with_mds$MDS_V2,y = fitted(fit)) %>% 
#       layout(title=paste("Scatterplot of", selected_var, "vs MDS_V2"), 
#              yaxis=list(title=selected_var), xaxis=list(title="MDS_V2"))
# 
#   })
# 
#   output$info <- renderText({
#     selected_var = input$var_name
#     fit <- lm(paste(selected_var, "~", "MDS_V2"), data = baseball_data_with_mds)
#     paste0("R2: ", summary(fit)$r.squared, "; Correlation: ", 
#            cor(baseball_data_with_mds$MDS_V2, baseball_data_with_mds[[selected_var]]))
#   })
# }
# 
# # Run the application
# shinyApp(ui = ui, server = server, options = list(width="800px", height="900px"))
# # END SHINY APP----------------------------------------------------------------

# Correlations
# HR.per.game - 0.707, X3B - -0.646, HR - 0.708, SH - -0.591, BAvg - -0.51, IBB - -0.572
# Most correlated - HR, X3B

plot_ly(baseball_data_with_mds) %>% add_markers(x=~MDS_V2, y=~HR) %>%
  layout(title="Scatterplot of Home Runs vs MDS_V2", yaxis=list(title="Home Runs"))
plot_ly(baseball_data_with_mds) %>% add_markers(x=~MDS_V2, y=~X3B) %>%
  layout(title="Scatterplot of Triples vs MDS_V2", yaxis=list(title="Triples"))